{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OAxxNGFDJhqk"
   },
   "source": [
    "# Analyze GPT using Embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JM_SoupgJs-O"
   },
   "source": [
    "## Load ChatGPT\n",
    "\n",
    "The following is an example of an analysis of data collected from GPT-3.5 from Open AI. It was collected using the OpenAI python API below and can be analyzed in Phoenix. This notebook will cover:\n",
    "\n",
    "* How to import prompt/response pairs collected from GPT\n",
    "* How to load the dataset into Phoenix for analysis \n",
    "* How to collect prompt/response pairs using the openAI python SDK\n",
    "\n",
    "⚠️ Generating embeddings is very fast on GPU instances (seconds) but can take a 2-3 minute on a CPU. We recommend using a GPU runtime instance for this notebook.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "conversations_df = pd.read_csv(\n",
    "    \"https://storage.googleapis.com/arize-assets/fixtures/Embeddings/GENERATIVE/dataframe_llm_gpt.csv\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "def string_to_array(s):\n",
    "    numbers = re.findall(r\"[-+]?\\d*\\.\\d+|[-+]?\\d+\", s)\n",
    "    return np.array([float(num) for num in numbers])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "conversations_df[\"prompt_vector\"] = conversations_df[\"prompt_vector\"].apply(string_to_array)\n",
    "conversations_df[\"response_vector\"] = conversations_df[\"response_vector\"].apply(string_to_array)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "conversations_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dugp6Krv3Thw"
   },
   "source": [
    "Installing Arize to make use of the embeddings generators available for use from the SDK generators package"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install arize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install 'arize[AutoEmbeddings]'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from arize.pandas.embeddings import EmbeddingGenerator, UseCases\n",
    "\n",
    "generator = EmbeddingGenerator.from_use_case(\n",
    "    use_case=UseCases.NLP.SEQUENCE_CLASSIFICATION,\n",
    "    model_name=\"distilbert-base-uncased\",\n",
    "    tokenizer_max_length=512,\n",
    "    batch_size=100,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0_NWVSp83hdK"
   },
   "source": [
    "Generate embeddings for each Prompt and Response column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Very fast on GPU (seconds) but can take a 2-3 minute on a CPU\n",
    "conversations_df = conversations_df.reset_index(drop=True)\n",
    "if not all(col in conversations_df.columns for col in [\"prompt_vector\", \"response_vector\"]):\n",
    "    conversations_df[\"prompt_vector\"] = generator.generate_embeddings(\n",
    "        text_col=conversations_df[\"prompt\"]\n",
    "    )\n",
    "    conversations_df[\"response_vector\"] = generator.generate_embeddings(\n",
    "        text_col=conversations_df[\"response\"]\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6fO2154Q3sp0"
   },
   "source": [
    "**Install Phoenix**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install arize-phoenix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import phoenix as px\n",
    "\n",
    "# Define a Schema() object for Phoenix to pick up data from the correct columns for logging\n",
    "schema = px.Schema(\n",
    "    feature_column_names=[\n",
    "        \"step\",\n",
    "        \"conversation_id\",\n",
    "        \"api_call_duration\",\n",
    "        \"response_len\",\n",
    "        \"prompt_len\",\n",
    "    ],\n",
    "    prompt_column_names=px.EmbeddingColumnNames(\n",
    "        vector_column_name=\"prompt_vector\", raw_data_column_name=\"prompt\"\n",
    "    ),\n",
    "    response_column_names=px.EmbeddingColumnNames(\n",
    "        vector_column_name=\"response_vector\", raw_data_column_name=\"response\"\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the dataset from the conversation dataframe & schema\n",
    "conv_ds = px.Dataset(conversations_df, schema, \"production\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Click the link below to open in a view in Phoenix of ChatGPT data\n",
    "px.launch_app(conv_ds)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ee96QCiT44Qf"
   },
   "source": [
    "**Collecting GPT Prompt & Response Data**\n",
    "\n",
    "In order to analyze data in Phoenix the dataframe in the pervious section was collected using the code below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install \"openai>=1\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import getpass\n",
    "import os\n",
    "\n",
    "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n",
    "    openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n",
    "    os.environ[\"OPENAI_API_KEY\"] = openai_api_key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "import uuid\n",
    "\n",
    "from openai import OpenAI\n",
    "\n",
    "client = OpenAI()\n",
    "\n",
    "messages = []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"This is the ChatGPT interface. Type in prompt!\")\n",
    "print(\"the prompt/response data will be concatenated with conversations_df dataframe.\")\n",
    "print(\"This cell can be run many times to concatenate more data.\")\n",
    "print(\"To Exit type: CTRL-D\")\n",
    "\n",
    "\n",
    "def count_words(text):\n",
    "    words = text.split()\n",
    "    return len(words)\n",
    "\n",
    "\n",
    "try:\n",
    "    data = {\n",
    "        \"prompt\": [],\n",
    "        \"response\": [],\n",
    "        \"step\": [],\n",
    "        \"conversation_id\": [],\n",
    "        \"prediction_id\": [],\n",
    "        \"api_call_duration\": [],\n",
    "        \"prompt_len\": [],\n",
    "        \"response_len\": [],\n",
    "    }\n",
    "    # This represents a single string of a conversation\n",
    "    conversation_id = uuid.uuid4().hex[:4]\n",
    "    step = 0\n",
    "    while True:\n",
    "        message = input(\"Prompt : \")\n",
    "        start_time = time.time()\n",
    "        if message:\n",
    "            messages.append(\n",
    "                {\"role\": \"user\", \"content\": message},\n",
    "            )\n",
    "            chat = client.chat.completions.create(model=\"gpt-3.5-turbo\", messages=messages)\n",
    "        end_time = time.time()\n",
    "        reply = chat.choices[0].message.content\n",
    "        data[\"prediction_id\"].append(str(uuid.uuid4())[:20])\n",
    "        data[\"prompt\"].append(message)\n",
    "        data[\"response\"].append(reply)\n",
    "        data[\"step\"].append(step)\n",
    "        data[\"conversation_id\"].append(conversation_id)\n",
    "        data[\"api_call_duration\"].append(end_time - start_time)\n",
    "        data[\"prompt_len\"].append(\n",
    "            count_words(message)\n",
    "        )  # Words / not tokens, but just a simple example\n",
    "        data[\"response_len\"].append(\n",
    "            count_words(reply)\n",
    "        )  # Words / not tokens, but just a simple example\n",
    "        print(str(end_time - start_time))\n",
    "        step += 1\n",
    "        print(f\"ChatGPT Response: {reply}\")\n",
    "        messages.append({\"role\": \"assistant\", \"content\": reply})\n",
    "except Exception as e:\n",
    "    print(\"Exiting Chat\")\n",
    "    print(str(e))\n",
    "    df = pd.DataFrame(data)\n",
    "    conversations_df = pd.concat([conversations_df, df])\n",
    "    messages = []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "conversations_df = conversations_df.reset_index(drop=True)\n",
    "conversations_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "e62u0uYdAmUS"
   },
   "source": [
    "The example above is just for test purposes and application specific integrations will look different. "
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
